- Introduce ourselves - name, research area, what you want from this course
- The course web page has links to all code and presentations
21.11.2023
Why statistics at all? What is the goal of statistical analysis?
Why statistics at all? What is the goal of statistical analysis?
What is the probability that my model is correct given what I already know about it and what I’ve learned?
Imagine a box with a total area of 1, representing all possible events
| !B | B | |
|---|---|---|
| !A | 1 - pr(A) - pr(B) | pr(B) |
| A | pr(A) | 0 |
| !B | B | |
|---|---|---|
| !A | 1 - pr(A) - pr(B) + pr(A,B) | pr(B) - pr(A,B) |
| A | pr(A) - pr(A,B) | pr(A,B) |
\[pr(A,B) = pr(A|B)pr(B)\]
\[pr(A,B) = pr(A|B)pr(B)\]
\[pr(A,B) = pr(A|B)pr(B)\]
Can anyone define probability?
Are you a (latent) zombie?
The problem:
People are turning into zombies! We have a test, but it is imperfect, with a false positive rate = 1% and a false negative rate = 0.5%.
You take the test, and the result is positive. What is the probability that you are actually going to become a zombie?
She grabs a shotgun…
Hopefully (for the sake of your health), this is unsatisfying… but why?
Are you a (latent) zombie?
The problem:
People are turning into zombies! We have a test, but it is imperfect, with a false positive rate = 1% and a false negative rate = 0.5%.
You take the test, and the result is positive. What is the probability that you are actually going to become a zombie?
Let’s add some information: We also learn that 0.1% of the population is infected.
Are you a (latent) zombie?
The problem:
People are turning into zombies! We have a test, but it is imperfect, with a false positive rate = 1% and a false negative rate = 0.5%.
You take the test, and the result is positive. What is the probability that you are actually going to become a zombie?
Let’s add some information: We also learn that 0.1% of the population is infected.
Exercise: Use the probability laws we know to compute the probability that you are a zombie.
Intuitively: the test is good, so the probability that a positive testing individual is a zombie should be high
(many people answer 99%, given the false positive rate of 1%).
Unintuitively: zombies are very rare, so when testing many people randomly, many tests will be false positives.
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
| Test+ | Test- | Sum | |
|---|---|---|---|
| Zombie | – | – | – |
| Not Zombie | – | – | – |
| Sum | – | – | 1,000,000 |
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
| Test+ | Test- | Sum | |
|---|---|---|---|
| Zombie | – | – | 1,000 |
| Not Zombie | – | – | 999,000 |
| Sum | – | – | 1,000,000 |
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
| Test+ | Test- | Sum | |
|---|---|---|---|
| Zombie | 995 | 5 | 1,000 |
| Not Zombie | – | – | 999,000 |
| Sum | – | – | 1,000,000 |
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
| Test+ | Test- | Sum | |
|---|---|---|---|
| Zombie | 995 | 5 | 1,000 |
| Not Zombie | 9,990 | 989,010 | 999,000 |
| Sum | 10,985 | 989,015 | 1,000,000 |
The positive test is a given. This shrinks our world of possibilities
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
| Test+ | Test- | Sum | |
|---|---|---|---|
| Zombie | 995 | ||
| Not Zombie | 9,990 | ||
| Sum | 10,985 |
0.1% of the population is infected with a parasite that will turn them into zombies.
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
false negative rate = 0.5%
false positive rate = 1%
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
\(pr(T | Z) = 1 - pr(T' | Z) = 1 - 0.005 = 0.995\)
\(pr(T' | Z') = 1 - pr(T | Z') = 1 - 0.01 = 0.99\)
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
\(pr(Z,T) = pr(T|Z)pr(Z) = 0.995 \times 0.001 = 0.000995\)
\(pr(Z|T) = \frac{pr(T|Z)pr(Z)}{pr(T)}\)
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
\[pr(Z|T) = \frac{pr(T|Z)pr(Z)}{pr(T)}\]
\[ \begin{aligned} pr(T) & = pr(T,Z) + pr(T,Z') \\ & = pr(T|Z)pr(Z) + pr(T|Z')pr(Z') \\ & = 0.995 \times 0.001 + 0.01 \times 0.999 \\ & = 0.000995 + 0.000999 \\ & = 0.010985 \end{aligned} \]
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
\[ \begin{aligned} pr(Z|T) & = \frac{pr(T|Z)pr(Z)}{pr(T)} \\ & = \frac{0.995 \times 0.001}{0.010985} \\ & = 0.0906 \end{aligned} \]
Desired outcome: \(pr(Z | T)\)
(if I test positive, what is the probability I am a zombie?)
She grabs a shotgun…
Hopefully (for the sake of your health), this is unsatisfying… but why?
Can anyone define probability?
The zombie example is cute, but it is a real biological problem. “True” state is often hidden, we have an imperfect signal.
| Observed | Not Observed | |
|---|---|---|
| Present | — | — |
| Absent | — | — |
Desired outcome: presence/absence of endangered species
Imperfect indicator (expert observation)
Desire to know \(pr(present | observed)\)
True positive: \(pr(P|O)\):
| Observed | Not Observed | |
|---|---|---|
| Present | True positive | — |
| Absent | — | — |
Desired outcome: presence/absence of endangered species
Imperfect indicator (expert observation)
Desire to know \(pr(present | observed)\)
True positive: \(pr(P|O)\):
False positive: \(pr(P'|O)\):
| Observed | Not Observed | |
|---|---|---|
| Present | True positive | — |
| Absent | False positive | — |
Desired outcome: presence/absence of endangered species
Imperfect indicator (expert observation)
Desire to know \(pr(present | observed)\)
True positive: \(pr(P|O)\):
False positive: \(pr(P'|O)\):
False negative: \(pr(P|O')\):
| Observed | Not Observed | |
|---|---|---|
| Present | True positive | False negative |
| Absent | False positive | — |
Desired outcome: presence/absence of endangered species
Imperfect indicator (expert observation)
Desire to know \(pr(present | observed)\)
True positive: \(pr(P|O)\):
False positive: \(pr(P'|O)\):
False negative: \(pr(P|O')\):
True negative: \(pr(P'|O')\):
| Observed | Not Observed | |
|---|---|---|
| Present | True positive | False negative |
| Absent | False positive | True negative |
\[ \begin{aligned} pr(A,B) & = pr(A|B)pr(B) \\ \end{aligned} \]
\[ \begin{aligned} pr(A,B) & = pr(A|B)pr(B) \\ \end{aligned} \]
\[ \begin{aligned} pr(A,B,C) & = pr(A|B,C)pr(B,C) \\ & = pr(A|B,C)pr(B|C)pr(C) \end{aligned} \]
\[ \begin{aligned} pr(A,B) & = pr(A|B)pr(B) \\ \end{aligned} \]
\[ \begin{aligned} pr(A,B,C) & = pr(A|B,C)pr(B,C) \\ & = pr(A|B,C)pr(B|C)pr(C) \end{aligned} \]
\[ \begin{aligned} pr(\bigcap_{k=1}^{n} A_k) & = pr(A_n | \bigcap_{k=1}^{n-1} A_k )pr(\bigcap_{k=1}^{n-1} A_k) \\ & =\prod_{k=1}^{n}pr(A_k | \bigcap_{j=1}^{k-1}A_j) \end{aligned} \]
\[pr(k = 10 | n = 10, p = 0.3) = 0.3^{10} \approx 0.000 \]
\[pr(Z_1,Z'_{2..10}) = 0.3 \times0.7^9 \approx 0.012 \]
\[pr(k=1|n=10,p=0.3) = 10 \times 0.3 \times0.7^9 \approx 0.121\]
\[pr(Z_{a}, Z'_{a'}) = p^k(1 - p)^{(n - k)}\]
\[pr(k|n,p) = {n \choose k} p^k(1-p)^{(n-k)}\]
choose(n = 10, k = 0:10) ## [1] 1 10 45 120 210 252 210 120 45 10 1 round(dbinom(0:10, 10, 0.3), 3) ## [1] 0.028 0.121 0.233 0.267 0.200 0.103 0.037 0.009 0.001 0.000 0.000
This is the probability mass function (PMF) of the binomial distribution (dbinom in R) \[pr(k|n,p) = {n \choose k} p^k(1-p)^{(n-k)}\]
What is the probability of observing \(k\) events out of \(n\) independent trials, when \(pr(k) = p\)?
This is the probability mass function (PMF) of the binomial distribution (dbinom in R) \[pr(k|n,p) = {n \choose k} p^k(1-p)^{(n-k)}\]
What is the probability of observing \(k\) events out of \(n\) independent trials, when \(pr(k) = p\)?
What is the probability of observing \(\le k\) events? Cumulative distribution function (CDF)
\[ pr(X \le k|n,p) = \sum_{i=0}^{k} {n \choose i}p^i(1-p)^{(n-i)} \]
k = 0:10 y = pbinom(k, 10, 0.3) round(y, 3) ## [1] 0.028 0.149 0.383 0.650 0.850 0.953 0.989 0.998 1.000 1.000 1.000 round(sum(dbinom(0:2,10,0.3)), 3) ## [1] 0.383
I invent a zombie detector, it counts up every time a zombie walks past. I put them out in busy parks. How many zombies do I get?
lam = 5 pois_dat = data.frame(x = 0:20) pois_dat$pmf = dpois(pois_dat$x, lam) pois_dat$cdf = ppois(pois_dat$x, lam)
I invent a zombie detector, it counts up every time a zombie walks past. I put them out in busy parks. How many zombies do I get?
lam = c(0.5, 2, 5, 20) pois_dat = expand.grid(x=0:50, lam=lam) pois_dat$pmf = dpois(pois_dat$x, pois_dat$lam) pois_dat$cdf = ppois(pois_dat$x, pois_dat$lam)
\[\mu = \frac{pr}{1-p}\] \[ s^2 = \mu + \frac{\mu^2}{r} \]
dat = expand.grid(x = 0:60, mu = c(10,20), size = c(5, 2)) dat$pmf = with(dat, dnbinom(x, mu=mu, size=size)) dat$cdf = with(dat, pnbinom(x, mu=mu, size=size))
For a zombie detector in a park, how much time will pass between each zombie passing by the detector?
lam = c(0.5, 2, 5, 20) dat = expand.grid(x=seq(0,15, length.out=100), lam=lam) dat$pdf = dexp(dat$x, dat$lam) dat$cdf = pexp(dat$x, dat$lam)
dat = expand.grid(x=seq(0,15, length.out=100), shape=c(0.5, 4), rate = c(0.2, 2)) dat$pdf = with(dat, dgamma(x, shape=shape, rate = rate)) dat$cdf = with(dat, pgamma(x, shape=shape, rate = rate))
dat = expand.grid(x=seq(-6,6, length.out=100), mu=0, sd = c(0.2, 1, 2)) dat$pdf = with(dat, dnorm(x, mu, sd)) dat$cdf = with(dat, pnorm(x, mu, sd))
dat = expand.grid(x=seq(0,1, length.out=100), alpha=c(0.5, 1, 2), beta = c(0.5, 1, 2)) dat$pdf = with(dat, dbeta(x, alpha, beta)) dat$cdf = with(dat, pbeta(x, alpha, beta))
d functions in R (probability density) – dnorm, dgamma, etcp functions in R (cumulative probability) – pnorm, pgamma, etc